-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add skipmissings #111
Add skipmissings #111
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this! Can you add a few tests?
I wonder whether this couldn't be called skipmissings
. That would be simpler and easier to discover. what do you think?
EDIT: could you run some benchmarks, e.g. comparing with iterating over the input vectors when there are no missing values?
src/Missings.jl
Outdated
y = iterate(itr.c, ntuple(_ -> first(state), width(itr))) | ||
y === nothing && return nothing | ||
item, state = y | ||
while any(ismissing, item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid repeating the iterate
stuff by doing while true
with any(ismissing, item) || break
.
Also I suspect that for performance any(x -> x ===missing, item)
will be better due to inference limitations. And since that's what skipmissing
does, better be consistent anyway.
src/Missings.jl
Outdated
``` | ||
""" | ||
function multiskipmissing(args...) | ||
return ((MultiSkipMissing(args[i], zip(args...), i) for i in 1:length(args))...,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ((MultiSkipMissing(args[i], zip(args...), i) for i in 1:length(args))...,) | |
ntuple(i -> MultiSkipMissing(args[i], zip(args...), i), length(args)) |
Could also use the short function syntax since that's a one-liner..
src/Missings.jl
Outdated
# Examples | ||
``` | ||
julia> x = [1, 2, missing, 4]; y = [1, 2, 3, missing]; | ||
julia> tx, ty = multiskipmissing(x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to copy the output, which is interesting for users BTW. Also add a line break before (basically, just copy/paste the real output).
src/Missings.jl
Outdated
``` | ||
julia> x = [1, 2, missing, 4]; y = [1, 2, 3, missing]; | ||
julia> tx, ty = multiskipmissing(x, y) | ||
julia> for t in tx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably simpler to show collect(t)
?
I responded to comments. I also ported the Ideally the type of
Then we use And we can dispatch on the elements of the |
I'm not sure what prevents you from doing this. Can you show what you have tried? |
I have updated this PR with a new implementation. Now the Additionally, I have updated the type signatures for the optimized |
Project.toml
Outdated
DataAPI = "1.1" | ||
julia = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this.
src/Missings.jl
Outdated
|
||
# Returns nothing when the input contains only missing values, and Some(x) otherwise | ||
@noinline function Base.mapreduce_impl(f, op, itr::SkipMissings{<:AbstractArray}, | ||
ifirst::Integer, ilast::Integer, blksize::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation.
src/Missings.jl
Outdated
""" | ||
skipmissings(args...) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Missings.jl
Outdated
return ntuple(i -> SkipMissings{typeof(z), i}(z, i), length(args)) | ||
end | ||
|
||
struct SkipMissings{C, I} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct SkipMissings{C, I} | |
struct SkipMissings{C} |
Maybe also note this Z
?
src/Missings.jl
Outdated
Base._mapreduce(f, op, Base.IndexStyle(_inner_itr(itr)), Base.eltype(itr.z) <: Tuple{Vararg{Union{Missing, Any}}} ? | ||
itr : _inner_itr(itr)) | ||
|
||
function Base._mapreduce(f, op, ::IndexLinear, itr::SkipMissings{C, I}) where C <: ZipofArrays where I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function Base._mapreduce(f, op, ::IndexLinear, itr::SkipMissings{C, I}) where C <: ZipofArrays where I | |
function Base._mapreduce(f, op, ::IndexLinear, itr::SkipMissings{C}) where C <: ZipofArrays |
src/Missings.jl
Outdated
Base.mapreduce_impl(f, op, A, ifirst, ilast, Base.pairwise_blocksize(f, op)) | ||
|
||
# Returns nothing when the input contains only missing values, and Some(x) otherwise | ||
@noinline function Base.mapreduce_impl(f, op, itr::SkipMissings{C, I}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noinline function Base.mapreduce_impl(f, op, itr::SkipMissings{C, I}, | |
@noinline function Base.mapreduce_impl(f, op, itr::SkipMissings{C}, |
src/Missings.jl
Outdated
|
||
# Returns nothing when the input contains only missing values, and Some(x) otherwise | ||
@noinline function Base.mapreduce_impl(f, op, itr::SkipMissings{C, I}, | ||
ifirst::Integer, ilast::Integer, blksize::Int) where C <: ZipofArrays where I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ifirst::Integer, ilast::Integer, blksize::Int) where C <: ZipofArrays where I | |
ifirst::Integer, ilast::Integer, blksize::Int) where C <: ZipofArrays |
test/runtests.jl
Outdated
@@ -75,6 +75,20 @@ struct CubeRooter end | |||
@test collect(x) == [1, 2, 4] | |||
@test collect(x) isa Vector{Int} | |||
|
|||
x = [1, 2, missing, 4] | |||
y = [1, 2, 3, missing] | |||
z = [missing, missing, 3, 4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
z = [missing, missing, 3, 4] | |
z = [missing, missing, 3.1, 4.5] |
And then test the different eltypes.
Thanks for the feedback! Two issues: First, I don't understand how we can get rid of the Second, we don't have inference when compared to
I'm not sure how to fix it. Is the issue with my |
OK, I see. The problem is likely that Something that needs to be checked is what happens when you iterate over all iterators returned by |
Thanks for the suggestions, and sorry for taking so long to get back to this. i have added the index of the relevant vector as a type parameter and added lots of type asserts. Still no luck. Let me know if you have any new ideas. Otherwise I suppose we can iterate the vector separately. But that might have it's own performance drawbacks. I understand your other point and will work on it soon. |
src/Missings.jl
Outdated
|
||
function Base.iterate(itr::SkipMissings, state = 1) | ||
y = iterate(itr.z, ntuple(_ -> state, _width(itr))) | ||
function Base.iterate(itr::SkipMissings{Z, I, T}, state = 1)::Tuple{eltype(T), Union{Int, Nothing}} where {Z, I, T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@code_warntype sum(mx)
from your previous post seems to be OK if I remove this failing type assert:
function Base.iterate(itr::SkipMissings{Z, I, T}, state = 1)::Tuple{eltype(T), Union{Int, Nothing}} where {Z, I, T} | |
function Base.iterate(itr::SkipMissings{Z, I, T}, state = 1) where {Z, I, T} |
or if I fix it:
function Base.iterate(itr::SkipMissings{Z, I, T}, state = 1)::Tuple{eltype(T), Union{Int, Nothing}} where {Z, I, T} | |
function Base.iterate(itr::SkipMissings{Z, I, T}, state = 1)::Union{Tuple{eltype(T), Int}, Nothing} where {Z, I, T} |
Still failing for me. I'm on 1.3 still. I will download the 1.5 and see if that's the issue. |
The code correctly infers on 1.4-rc1! Now I will work on the I would assume we have a "less performant" release for 1.0-1.3 and then people in 1.4 just get the added benefit of having this code work better. |
I understand the To be fair, |
OK, cool. Then yes, let's not worry about performance on older releases, at least for now. Please remove any type asserts that are not needed on Julia 1.4 (and anyway they are not enough on older releases).
By There are two different paths to optimize: For the
Ah, indeed. We can probably extend that since the two objects barely need to be iterable and have the same length. Anyway, for now you can just test performance with a custom function, like this: function testperfzip(X, Y)
r = zero(eltype(x)) * zero(eltype(x))
for (x, y) in zip(skipmissings(X, Y)...)
r += x * y
end
r
end |
I benchmarked this a bit yesterday and it turns out it's enormously slow. I will spend some time next week uncovering the source of the allocations. I think the reason has to do with |
Good news! I think this is about on par with what I was expecting, after some re-factoring.
|
Cool. At least it looks like the overhead is reasonable. But have you tried with a larger vector? It's not clear whether the number of allocations increases with the length (it shouldn't). Also, it would be interesting to check with vectors which don't contain any missing values: there the overhead should be low or non-existent, and that's a first goal to reach. Can you push your latest changes? |
I think this is ready for another review. My performance improvement was a shift from using There is one big performance bottleneck and that is my use of
Which I then use to do the type signature of functions that need arrays as inputs, in particular
but |
Have you checked that |
I added a method for In fact, I think this PR is ready for a final review and then I would like to either
Both can be done in a separate PR. |
Nevermind. It was working and now it isn't. I'm not sure what changed and need to investigate further. |
@nalimilan I think I need you to take a look at it, or I should ask Discourse / Slack for help because I'm not understanding the lack of inference. I believe the issue is with The error has something to do with |
posted on discourse here |
Here is a gist for the new performance behavior. Performance still isn't great, I think. Since
|
I have responded to all comments above.
which is hacky but infers. On the other hand it isn't any faster than nested
|
src/Missings.jl
Outdated
@inbounds x_item = itr.x[ind] | ||
@inbounds while true | ||
x_item === missing || _anymissingindex(itr.others, ind) || break | ||
indnext = iterate(eix, indstate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call this inditr
as above? That's the same thing.
(BTW, it's a bit inconsistent to use x_item
but xitr
here and in the previous method. Better use the same convention regarding underscores everywhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -402,7 +406,7 @@ Base.mapreduce_impl(f, op, A::SkipMissingsofArrays, ifirst::Integer, ilast::Inte | |||
v = op(f(a1), f(a2)) | |||
@simd for i = i:ilast | |||
@inbounds ai = A[i] | |||
ai === missing || @inbounds !_anymissingindex(itr.others, i) || (v = op(v, f(ai))) | |||
ai === missing || @inbounds _anymissingindex(itr.others, i) || (v = op(v, f(ai))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was incorrect before and tests didn't spot it? Probably there was no test with an input large enough to use this branch (IIRC 16 elements is the threshold). Can you add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test for this.
I believe I successfully rebased! There were no conflicts. |
As you know, my rebasing skills are weak. Googling around it seems like a re-base is too complicated for this scenario because I have too many fixes to code I added during the course of this PR. This is an alternate method I got off slack. It works when I try it on a test branch.
|
I propose to do this, but then push |
I think the rebase is OK (except for that mysterious |
The rebase didn't work all the way, I still think it's because of all the re-factoring that I did mid-PR. You can tell because the |
I can confirm that |
But the |
It doesn't seem to be testing 1.4 yet, so it passes. The PR to have it test 1.4 is not in here because the rebase didn't work. If I |
But master doesn't appear to test 1.4. I've pulled locally and rebased against latest master, but indeed the branch seems to be really messy. I think the simplest solution would be to squash everything before rebasing, so that you only have to fix conflicts once. |
1a1d12e
to
f235617
Compare
Argh, since the |
Tests pass! |
@test isapprox(mapreduce(cos, *, collect(mx)), mapreduce(cos, *, mx)) | ||
if VERSION >= v"1.4.0-DEV" | ||
t = quote | ||
@inferred Union{Float64, Missing} mapreduce(cos, *, $mx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, thinking about it again, shouldn't this be just Float64
? The result cannot be missing
by definition, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the skipmissing
in Base we have
julia> @code_warntype sum(sx)
Variables
#self#::Core.Compiler.Const(sum, false)
a::Base.SkipMissing{Array{Union{Missing, Int64},1}}
Body::Union{Missing, Int64}
1 ─ %1 = Base.sum(Base.identity, a)::Union{Missing, Int64}
└── return %1
I don't know why the output works like that given eltype(sx) == Int
but it's a yet-to-be improved aspect of mapreduce
with skipmissing
. So it's fair that skipmissings
would work the same. You have a comment on some issue highlighting that it's not a high priority performance issue but I can't find it.
Thankfully small unions are fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I need to investigate this. It's somewhat ironic that the function designed to skip missings isn't inferred as such. :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I've realized @static if
works here instead of @eval
. I've committed that.
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Thanks! |
This PR adds functionality for
multiskipmissings
as discussed in Julia issue 30596 here.The motivation for this PR was noticing that
has no missing value support.
In order for this to work, you need to collect the non-missing matching values of
x
andy
. To do that I implement the functionmultiskipmissing
, which returns aTuple
of iterators.In the above example, each
tx
andty
areMultiSkipMissing
objects which havezip(x, y)
as a field. I then implementedskipmissing
from Base with some changed logic such that instead of checking if the array of interest has a missing element, I check if any elements of thei
th element ofzip(x, y)
are missing.I this PR is not complete. I still can't get
collect
to work, so the original goal is not yet met.